This report explores a dataset[1] containing 4898 white wines with 11 physiochmical variables (input) and 1 sensory variable (output). The inputs include objective tests (e.g. PH values) and the output is the median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Univariate Plots Section

## [1] 4898   12
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

The dataset consists of 12 numerical variables, with 4898 observations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5
## 
## poor   ok good 
##  183 3655 1060

The distribution of quality seems pretty “normal”. Not surprisingly, wine experts gave OK-but-mediocre score to most wines, with only a handful of the excellent (9) and the poor (3).

The wines are categorized to 3 buckets of “good”, “ok”, “poor” according to the score as follows:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

I set the binwidth to be smaller than the default setting to take a closer but noisier look of the data. The fixed.acidity of most wines falls around 6.75, with a few outliers to the right. The volatile.acidity is skewed to the right, with most wines of 0.27 volatile acidity. The citric.acid also has a few oultliers to the right, and an interesting distribution if the binwidth is set to less than 0.05 (in the plot, it’s set to 0.02). There’re two peaks at 0.3 and 0.47 or so.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The residual.sugar is skewed, and the max value is way greater than rest of the observations. The highest values are trimmed in the second plot so more details are revealed. Most wines contain residual sugar at around 1.25. The transformed sugar distribution in the third plot appears bimodal with peaks around 1.25 and 10.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Despite the outliers to the far right end, the distribution of chlorides looks almost “normal” too. Most wines contain chlorides between 0.025 and 0.0625. The mean is 0.046 and the median is 0.043.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

##     90% 
## 0.99815

Since the density of white wines is super close to the density of water (1.000 g/mL at 3.98 °C[2]), I set the binwidth particularly small (0.0005) to get more details. The plot is a bit skewed. Most wines are “ligher” than water, the 3rd quantile is 0.9961.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

As we know, most wines are acidic. The plot corresponds to the domain knowledge. In this dataset, most samples are between 3-3.6 on the pH scale. The median and the mean almost fall at the same number, around 3.18. The distribution is mostly symmetric.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The plot is skewed, but not in an extravagant way. The mode appears around 9.3, which is lower than the 1st quantile (9.5).

## 
##  light medium  heavy 
##   4460    397     41

## wqw$alcohol.level: light
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     6.0     5.8     6.0     9.0 
## -------------------------------------------------------- 
## wqw$alcohol.level: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   6.000   7.000   6.668   7.000   9.000 
## -------------------------------------------------------- 
## wqw$alcohol.level: heavy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   6.000   7.000   6.659   7.000   8.000

I added a new variable by categorize wines according to the percent alcohol content:

Light-bodied wines (4460) are way more than full-bodied wines (41).

From the histogram, we notice that the modes of full-bodied and medium-bodied are 7, the light-bodied is 6. Looking at the summary, overall and averagly medium/full-bodied wines are better than light-bodied ones. However, the best wines in the dataset (score 9) are light/medium-bodied.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

All of the statistics of total.sulfur.dioxide are greater than free.sulfur.dioxide, which makes sense since the latter is a superset of the former. The former also has a few high outliers. Trim them and zoom in.

After zooming in, the plot looks quite similar to total.sulfur.dioxide. So I speculate these two variables are highly correlated.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    78.0   100.0   103.1   125.0   331.0

I wonder how the bound forms of \(SO_2\) exist in the wine, so a new variable is created by subtract free.sulfur.dioxide from total.sulfur.dioxide. I’m also interested how this variable relates to the free form \(SO_2\).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Potassium Sulphate is a wine additive contributing to sulfur dioxide gas levels. Most values are below 1.0. The mode appears aruond 0.46.

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations in the dataset with 12 features (fixed.acidity, volatile.acidity, citric.acid, sugar, chlorides, density, pH, alcohol, free.sulfur.dioxide, total.sulfur.dioxide, sulphates and quality). All the variables are numerical. 11 of them are physiochemical measurements from objective tests. quality is based on sensory data from wine experts.

  • Most wines are scored 6. The best ones are scored 9, and there’re 5 of them.
  • The median alcohol by volume is 10.4%.
  • There are much more light-bodied wines than full-bodied wines.
  • Full-bodies wines are generally better than light-bodied ones.
  • The wines are acidic, with pH range from 2.5 to 4.
  • 90% of the wines are lighter than water.

What is/are the main feature(s) of interest in your dataset?

The main features intriguing me are the quality and alcohol variables. I’d like to investigate which chemical properties influence the wine taste. There are of course other variables playing supportive roles.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Acidity, sugar, chlorides, density and \(SO_2\) are other features I’ll take into account.

Did you create any new variables from existing variables in the dataset?

I created a new variable by assigning the quality values to a 3-level (“good”, “ok”, “poor”) factor variable.

Similar categorization was applied to the alcohol varaible, depending on the alcohol content, the observations were divided into “light”, “medium” and “heavy” groups.

A variable for the bound form of \(SO_2\) is created by subtracting the amount of free form \(SO_2\) from the total amount. I’m interested in how this variable correlated with the free form \(SO_2\), and how it contributes to the wine quality.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I trimmed a few high outliers for residual.sugar, chlorides and free.sulfur.dioxide to zoom in to the majority of the data.

I also log-transformed the right skewed residual.sugar distribution. The transformed distribution appeared bimodal.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
## bound.sulfur.dioxide    0.13566071       0.15676923  0.102179337
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
## bound.sulfur.dioxide     0.34484449  0.19379550        0.2635372837
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
## bound.sulfur.dioxide          0.922482350  0.50444690  0.0031433874
##                        sulphates     alcohol      quality
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831
## volatile.acidity     -0.03572815  0.06771794 -0.194722969
## citric.acid           0.06233094 -0.07572873 -0.009209091
## residual.sugar       -0.02666437 -0.45063122 -0.097576829
## chlorides             0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
## density               0.07449315 -0.78013762 -0.307123313
## pH                    0.15595150  0.12143210  0.099427246
## sulphates             1.00000000 -0.01743277  0.053677877
## alcohol              -0.01743277  1.00000000  0.435574715
## quality               0.05367788  0.43557472  1.000000000
## bound.sulfur.dioxide  0.13569394 -0.42692304 -0.217867760
##                      bound.sulfur.dioxide
## fixed.acidity                 0.135660713
## volatile.acidity              0.156769227
## citric.acid                   0.102179337
## residual.sugar                0.344844495
## chlorides                     0.193795498
## free.sulfur.dioxide           0.263537284
## total.sulfur.dioxide          0.922482350
## density                       0.504446902
## pH                            0.003143387
## sulphates                     0.135693943
## alcohol                      -0.426923036
## quality                      -0.217867760
## bound.sulfur.dioxide          1.000000000

There isn’t any variable that is strongly correlated with the quality. The alcohol has a meaningful but small correlation with the quality. Besides, the alcohol has a moderate negative correlation with density. This makes sense since we know that the density is affected by sugar and ethanol, while ethanol is “lighter”" (0.789 g/cm³) than water, thus more alcohol leads to lower density.

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

The first plot shows 7 vertical strips. Transparency, jitter and a conditional mean on alcohol are added to adjust the overplotting. The second plot gives us a vague trend. The third figure is a box plot using the new categorical varialbe, which shows a more clear realationship. Overall the highest alcohol content tends to highest quality, while the lowest alcohol gives majority of the mediocre quality.

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   6.575   7.300   7.600   8.525  11.800 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.800   6.400   6.900   7.129   7.600  10.200 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.500   6.400   6.800   6.934   7.400  10.300 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.838   7.300  14.200 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   6.200   6.700   6.735   7.200   9.200 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.900   6.200   6.800   6.657   7.300   8.200 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.60    6.90    7.10    7.42    7.40    9.10

The best quality does have a slightly higher median and mean of fixed acidity. But neither the scatter plot nor the box plot give us a compelling trend.

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1700  0.2375  0.2600  0.3332  0.4125  0.6400 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1100  0.2700  0.3200  0.3812  0.4600  1.1000 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.240   0.280   0.302   0.340   0.905 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.2000  0.2600  0.2774  0.3300  0.6600 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.240   0.260   0.270   0.298   0.360   0.360

Too high of levels of acetic acid in wine can lead to an unpleasant, vinegar taste. I thought this feature would be an effecting one. The mean of poor quality wines (score 3 and 4) do have a higher mean. But the best quality wine (score 9) doesn’t have the lowest level of acetic acid. The lowest levels mainly contributes to the OK ones.

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2100  0.2575  0.3450  0.3360  0.3850  0.4700 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1900  0.2900  0.3042  0.4000  0.8800 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2400  0.3200  0.3377  0.4100  1.0000 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.270   0.320   0.338   0.380   1.660 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.2800  0.3100  0.3256  0.3600  0.7400 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.2800  0.3200  0.3265  0.3600  0.7400 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   0.340   0.360   0.386   0.450   0.490

The best quality wines (score 9) has the highest median and mean levels of citric acid, which brings up the “freshness”" and pleasant flavor of wines. The second poor group has the lowest median levels of citric acid. But there isn’t too much variation in the rest of the wines.

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.587   4.600   6.393  10.700  16.200 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60

The median of sugar content jumps up and down across the quality levels. Most of the points crams at the bottom. There isn’t a particular trend to describe the relationship between residual sugar and quality.

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

The chlorides variable contains a bunch of outlers as well. I added a coord_cartesian layer to trim them. Turns out the lower the chlorides exist, the better the the quality is.

## [1] 0.9224823
## [1] 0.615501
## [1] 0.2635373

The first plot shows total.sulfur.dioxide and bound.sulfur.dioxide are linear correlated. The free.sulfur.dioxide and total.sulfur.dioxide has a weaker linear relationship. The third plot doesn’t show a strong relationsip.

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    14.0    82.5   106.0   117.3   152.2   331.0 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    67.5   102.0   101.9   133.8   195.0 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    91.0   114.0   114.5   137.0   293.5 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    15.0    76.0    97.0   101.4   123.0   243.0 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   71.00   86.00   90.99  106.00  199.00 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   42.00   71.00   84.00   89.45  104.50  159.50 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.0    62.0    82.0    82.6    96.0   112.0

Since the covariance of bound.sulfur.dioxide and quality is highest among the three sulfur variables. I only look into the plots between these two. Similar with the chlorides, the lower bound form \(SO_2\) exists, the better the quality tends to be.

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0001 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0004 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0024 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0004 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0006 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9897  0.9898  0.9903  0.9915  0.9906  0.9970

The density is negatively correlated with alcohol. Not surprisingly, the best quality wines have the lowest density. But the highest density doesn’t atttribute to the worst quality.

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.035   3.215   3.188   3.325   3.550 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.830   3.070   3.160   3.183   3.280   3.720 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.790   3.080   3.160   3.169   3.240   3.790 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.080   3.180   3.189   3.280   3.810 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.214   3.320   3.820 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.940   3.120   3.230   3.219   3.330   3.590 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   3.280   3.280   3.308   3.370   3.410

The correlation between the pH and the quality is not significant. But according to the plot, better wines tend to be less acidic overall.

The two plots show how sugar and alcohol affect density. With more sugar remained and less less alcohol content, the density goes higher.

The density also has a weak but meaningful positive relationship with the bound form sulfur dioxide.

The correlation between alcohol and sugar is vague, basically negative, but not strong.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • The calculated pearson r indicates that there isn’t a single feature that are strongly correlated with quality. However, among all the features, alcohol’s impact is much more than the others. It has a positive correlation with quality.
  • Density, chlorides, and bound sulfur dioxide have negative relationships with quality, respectively.
  • pH has a very weak positive relationship with quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • Alcohol has a strong negative relationship with density.
  • Density is strongly impacted by residual sugar and alcohol, positively related with total sulfur dioxide.

What was the strongest relationship you found?

  • The strongest relationship is between the bound.sulfur.dioxide and total.sulfur.dioxide (r = 0.92), Which means the most part of sulfur dioxide are from the bound form.
  • The density and alcohol also has a strong relationship (r = 0.78), compared to the other features.
  • Against the quality, alcohol has the strongest relationship with it.

Multivariate Plots Section

The distribution of alcohol (%) faceted by quality, colored by alcohol level. Mostly, medium/full-bodied wines fall in the higher quality groups. Each distribution is skewed or lack of enough observations. Interestingly, The mode is gradually moving from left to right (right-skewed to left skewed).

Besides the findings from the last plot, we can see not only the mode, but the distribution also shifts from the right to the left.

This nebulous plot depicts the relationship between chlorides and alcohol, colored by quality. Despite a few high quality wines in the bottom right corner, the plot is sectioned as top left part with lots of purple/red dots and bottom right part with yellow dots, although they overlap in the top left corner as well. High quality wines mostly contain high level of alcohol and low in chloride, but not vice versa.

This plot interestingly depicts a few relationships. It’s faceted by alcohol level, density along the y axis, apparently the three clustered move downward from the first to the third, as the density of each cluster is getting lower. Also, the pH value of the third cluster is more concentrated around 3.3, while the other two spreads out a lot, and centered smaller than 3.3.

High alcohol (%) comes with low chlorides and low residual sugar.

This is a similar plot with different y axis, but tells more information. The sugar the alcohol both contributes to the density. More sugar make the liquid denser, while alcohol pulls the density down.

## # weights:  49 (36 variable)
## initial  value 9531.067910 
## iter  10 value 6936.249193
## iter  20 value 5953.931769
## iter  30 value 5651.788336
## iter  40 value 5644.772540
## iter  50 value 5642.097494
## iter  60 value 5639.352159
## iter  70 value 5632.208596
## iter  80 value 5625.493185
## iter  90 value 5624.923045
## iter 100 value 5622.909941
## final  value 5622.909941 
## stopped after 100 iterations
## Call:
## multinom(formula = factor(quality) ~ alcohol + chlorides + residual.sugar + 
##     density + pH, data = wqw)
## 
## Coefficients:
##   (Intercept)    alcohol  chlorides residual.sugar    density         pH
## 4  -120.81014 -0.4128723  -11.15200    -0.18646306  132.44834 -0.9107893
## 5   -94.53182 -0.6217789  -11.22721    -0.07452422  109.17594 -0.7565946
## 6    66.00632  0.0018524  -11.88587     0.03593580  -61.66156  0.1050599
## 7   112.85741  0.4084028  -30.63426     0.06898806 -117.16284  1.1912225
## 8    69.44586  0.7429319  -21.97037     0.10047706  -80.46100  1.5105167
## 9   -13.76324  0.8293298 -155.80289     0.07316911  -11.04498  5.8738811
## 
## Std. Errors:
##   (Intercept)   alcohol chlorides residual.sugar  density       pH
## 4    2.997051 0.2506668 5.8700800     0.05469809 2.990051 1.617535
## 5    2.845250 0.2353793 5.1183617     0.05096326 2.844434 1.541327
## 6    2.827615 0.2333035 5.1462789     0.05085075 2.843158 1.536579
## 7    2.879589 0.2352787 6.4461695     0.05145795 2.846302 1.549026
## 8    3.027010 0.2439436 8.8810602     0.05322446 2.958503 1.615454
## 9    6.608243 0.5258974 0.3584722     0.14876669 6.617603 3.316591
## 
## Residual Deviance: 11245.82 
## AIC: 11317.82

A multinomial logit regression is run against several variables. Quality score 3 is the reference group, so the other levels are estimated against it. The coefficients in each row are relative to the reference group. Based on the coefficients, alcohol and pH play more positive roles as the quality increases, while density and chlorides act negative. Sugar doesn’t change much across all the levels.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Alcohol along with chlorides makes a more obvious picture to determine the quality.

Were there any interesting or surprising interactions between features?

It’s not odd to see higher alcohol leads to lower density, but it’s a bit surprising to see higher alcohol also comes with lower chlorides and lower sugar, which may involve some chemistry knowledge and winemaking technology.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes. I created a multinomial logit regression to compute the coefficients. The major problem is a lot of the variables are correlated. But the other features’ influence are so little that I still keep the correlated ones. The strength here is I can see the coefficients change on different levels.


Final Plots and Summary

Plot One

Description One

There is an unexpected spike besides the mode. It could be the result of certain winemakers adding more than average citric acid as supplements to boost the acidity.

Plot Two

Description Two

Averagely, lower chlorides content leads to higher quality. Overall, the best white wines contain the lowest chlorides.

Plot Three

Description Three

The higher quality wines tend to have higher alcohol (%) and lower amount of salt.


Reflection

I would say alcohol and chlorides influence the wine quality most, although alcohol is correlated with density so it contributes to the taste as well. Surprisingly pH value also plays a role in determining the quality of white wines. Less acidic wines tend to create nicer flavor.

The biggest struggle is that there isn’t any feature that stands out and answers “who’s in charge” boldly. The output variable should actually be considered categorical,which is different from the course materials so the data exploration path needs adjustment to cope with this type of data. I also had difficulty picking up a reasonable model to complement the analysis.

This project also reminds me that background knowledge and common sense will tremendously help the EDA process. The data doesn’t speak for itself. It’s the analyst who interprete the data that introduce the reality to the data and reflect the data back.


[1]P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

[2]https://www.sigmaaldrich.com/catalog/product/sial/denwat?lang=en&region=US